Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System
نویسندگان
چکیده
Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application—and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.
منابع مشابه
Non-data-communication Overheads in MPI: Analysis on Blue Gene/P
Modern HEC systems, such as Blue Gene/P, rely on achieving high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. This means that the local preand post-communication processing required by the MPI stack might not be very fast, owing to the slow processing cores. Similarly, small amounts of serialization within the MPI stack that were acceptabl...
متن کاملThe Importance of Non-Data-Communication Overheads in MPI
With processor speeds no longer doubling every 18-24 months owing to the exponential increase in power consumption and heat dissipation, modern HEC systems tend to rely lesser on the performance of single processing units. Instead, they rely on achieving high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. Using such low-frequency cores, how...
متن کاملAn Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملAsynchronous Two-level Checkpointing Scheme for Large-scale Adjoints in the Spectral-element Solver Nek5000
Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient checkpointing methods play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for multistep numerical time discreti...
متن کاملAnalysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing
In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJHPCA
دوره 25 شماره
صفحات -
تاریخ انتشار 2011